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ABSTRACT 

The distributions of galaxy properties vary with environment, and are often multi- 
■ modal, suggesting that the galaxy population may be a combination of multiple com- 

ponents. The behaviour of these components versus environment holds details about 
Q . the processes of galaxy development. To release this information we apply a novel, 

£j ' nonparametric statistical technique, identifying four components present in the distri- 

ct , bution of galaxy Ha emission-line equivalent- widths. We interpret these components 

as passive, star-forming, and two varieties of active galactic nuclei. Independent of 
this interpretation, the properties of each component are remarkably constant as a 
function of environment. Only their relative proportions display substantial variation. 
^ • The galaxy population thus appears to comprise distinct components which are in- 

| dividually independent of environment, with galaxies rapidly transitioning between 

. components as they move into denser environments. 

OO ■ 
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1 COMPONENTS OF THE GALAXY history, particularly over the last < 10 9 years. The colour 

POPULATION bimodality thus implies a division of the galaxy population 

into blue galaxies, which have recently formed stars, and 

. : It has long been recognised that galaxies may be divided , , . , • , , , n i i ■ i i-, ■ ,i 

i—j red galaxies, which have not. such a bimodality in the star- 

Cu into at least two distinct sub-populations. Originally this „ , . r , . , , , , , 

. T i i » , , formation properties of galaxies has also been observed us- 

division was based on visual appearance. Most galaxies can . ,. - - .. , 

_ . . ing more direct measures of current star-formation, such as 



X 



emission-line strength (Balogh et al. 2004). 



be morphologically classified as either elliptical or spiral. 
Finer classification is possible, discretizing an apparently 

continuous variation in galaxy appearance. However the di- The Position of the red and blue galaxy sequences in the 

chotomy between elliptical and spiral morphology is more colour-luminosity or colour-stellar mass planes display only 

pronounced than the variations within each class. Subse- a weak dependence on environment. However, the relative 

quently, it has been discovered that several other, more proportions of galaxies in the two sequences vary strongly, 

quantitative, galaxy properties are distributed unevenly or In regions with a higher local galaxy density the fraction of 

in a multi-modal manner. galaxies on the red sequence is higher (Balogh et al. 2004; 

The colour distribution of SDSS galaxies is strongly Baldry et al. 2004, 2006). 

bimodal (Strateva et al. 2001). Galaxies in the "red" and It remains a matter of debate whether colour is more 

"blue" modes can be roughly identified as those with ellip- closely related to environment than morphology. Some claim 

tical and spiral morphology, respectively (Hogg et al. 2002; that trends in morphology versus environment can be mostly 

Driver et al. 2006) . Whereas morphology reflects the dynam- explained via a morphology-colour relation which is almost 

ical state of galaxies, colour is related to their star-formation independent of environment (Weinmann et al. 2006; Blanton 

et al. 2006; Ball et al. 2006; Wolf et al. 2007). However, other 
studies oppose this view (Park et al. 2007), and it has been 

* E-mail: steven.bamford@nottingham.ac.uk clearly shown that the colour and morphology bimodalities 
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behave differently with respect to environment and stellar 
mass (Bamford et al. 2008). 

There are growing indications that, in a fraction of 
the galaxy population, star-formation must be terminated 
rapidly (Balogh et al. 2004; Baldry et al. 2006). The emis- 
sion lines in galaxy spectra provide a way of measuring the 
level of current star formation on a timescale of < 10 7 years. 
They therefore trace rapid star formation variations more 
sensitively than colour. Another important property of emis- 
sion lines is that they are produced by active galactic nuclei 
(AGN), in addition to star formation. AGN are present in 
many galaxies, and are thought to be produced by accretion 
of material onto the super-massive black holes which appear 
to reside at the centre of most, if not all, galaxies (Richstone 
et al. 1998). Recently a variety of studies have suggested that 
AGN strongly influence star-formation in their host galax- 
ies, and thus play an important role in defining the galaxy 
population (Kauffmann et al. 2003; Silk 2005; Croton et al. 
2006; Bower et al. 2006). The potential presence of an AGN 
contribution complicates the traditional usage of emission- 
lines as an indicator of star formation rate (SFR) . However, 
it also presents an opportunity to study these two inter- 
dependent processes, star-formation and AGN, through the 
distribution of a single quantity. 

Galaxies with contrasting properties are found to be 
distributed differently in space. Elliptical galaxies cluster 
together more strongly than spirals (Beisbart & Kerscher 
2000; Giuricin et al. 2001). Similarly, red galaxies are pref- 
erentially found in denser environments than blue galaxies 
(Zehavi et al. 2005). We have a well developed theory for 
how structure forms in the cosmos, at least in terms of the 
underlying cold dark matter which dominates the mass den- 
sity (Springel et al. 2005). Baryonic matter is expected to 
be similarly distributed, in broad terms. This theory thus 
explains the range of galaxy environments observed. How- 
ever, the properties of galaxies as a function of environment 
is a much more complicated issue, depending on the de- 
tailed physics of galaxy formation and evolution. By study- 
ing trends in the galaxy population with environment we 
can learn about these physical processes. 

There has been a logical progression in studies of galaxy 
properties as a function of environment. Early work was 
based on dividing galaxies into simple classes and looking 
at variations in the fractions of galaxies of each class in bins 
of environment (Dressier et al. 1985). As galaxy samples 
grew, this moved on to examining trends in the mean prop- 
erties of galaxies as a smooth function of local galaxy density 
(Lewis et al. 2002; Gomez et al. 2003). A significant devel- 
opment was fitting to the data functions that describe the 
distribution of galaxies in two classes (Balogh et al. 2004; 
Baldry et al. 2004, 2006). Most of the approaches employed 
so far have relied upon enforcing a predefined view of how 
to divide or classify the galaxy population in increasingly 
complex ways. However, our understanding of the physical 
processes at work is highly uncertain and does not provide a 
sufficient basis to make this decision. Our only guide is the 
data itself. A natural next step is thus to turn to nonpara- 
metric methods, where the components of the population 
are deduced consistently from the data itself. 

Recently, several studies have performed multivariate 
statistical analyses on datasets containing a wide variety 
of galaxy properties, in order to identify components of 



the galaxy population, and determine which properties are 
most important for identifying to which component a galaxy 
belongs (Ellis et al. 2005; Conselice 2006). Such studies 
are highly informative, but become complicated when one 
wishes to determine the behaviour of the identified com- 
ponents versus another variable. In this paper we are pri- 
marily concerned with variation in the components of the 
galaxy population as a function of environment. The sta- 
tistical method we present below may be straightforwardly 
applied to multivariate datasets. However, for simplicity, in 
the present work we consider the environmental dependence 
of just one galaxy property. Nevertheless, even with this el- 
ementary approach, we are able to learn much about the 
galaxy population. 



2 CONDITIONAL DENSITY ESTIMATION 

A common problem in astronomy, and statistical sciences 
in general, is that one wishes to understand how the be- 
haviour of one variable depends upon another. This is rel- 
atively straightforward in the case where there is a single 
relationship between the variables, albeit with some, possi- 
bly variable, scatter or width to the distribution. Much sta- 
tistical and astronomical literature has been devoted to the 
development of such regression methods (Weisberg 2005). 
However, in the case where multiple components may be 
present in the overall distribution, each with a different func- 
tional dependence on the variables, the situation becomes 
substantially more difficult. One can still attempt to apply 
single-component statistical tools, for example nonparamet- 
ric quantile regression (Koenker & Bassett 1978; Yu & Jones 
1998), on the whole distribution, but the understanding one 
gains from such an exercise is limited and sometimes mis- 
leading. Alternatively one may individually analyse subsam- 
ples selected by defining regions in the parameter space, or 
preferably using additional information(Cherkassky & Ma 
2005). This approach, however, is unsuitable when the mul- 
tiple components significantly overlap, or when it is unclear 
how many components are present. 

Most regression techniques focus on estimating the con- 
ditional mean, the average value of one variable as a func- 
tion of another variable; for example, a line through a set 
of scattered points. However, one may get a better under- 
standing of the relationship between a response variable and 
a set of covariates by considering the estimation of the con- 
ditional density as a whole; the distribution of one variable 
as a function of another. (Note that density here refers to 
probability density as a function of the parameter set, not a 
measure of environmental local galaxy density as elsewhere 
in this paper.) We use a new conditional density estimator 
based on finite mixture models and local likelihood estima- 
tion, which describes the underlying relationship between 
two variables by a set of parameterised functions. This fea- 
ture gives the proposed procedure the advantage of being 
easily interpretable. This method is called nonparametric 
mixture regression (NMR) , and is described in detail in Ap- 
pendix A. 

The NMR technique has the potential to aid the under- 
standing of many datasets, across all fields of science. In the 
present work, it allows us to determine the environmental 
dependence for individual components of the galaxy popu- 



© 2008 RAS, MNRAS 000, 1-11 



Components of the galaxy population 3 




Figure 1. The distribution of transformed Ha equivalent width (VKhq) for (left) low and (right) high density environments. The histogram 
displays the data, with Poisson uncertainties indicated by the grey shading. The red, purple, green and blue lines show the components 
derived by applying the NMR technique. The brown line gives the sum of these components, which is clearly a good representation of 
the data. 



lation, with minimal prior assumptions on the number and 
properties of these components. 



3 GALAXY Ha EQUIVALENT WIDTHS 

The strongest emission line in a galaxy optical spectrum is 
Ha. The luminosity of Ha is approximately proportional to 
the rate of ongoing star- formation (Moustakas et al. 2006), 
when uncontaminated by additional emission, such as from 
an AGN. A commonly employed quantity is the equivalent 
width (EW) of a spectral line, the line flux normalised by 
the continuum flux at the same wavelength. The EW mea- 
surement has the advantages of being approximately inde- 
pendent of uncertainties in the spectral flux calibration and 
any extinction present in both the observed galaxy and our 
own. The Ha line is in the red region of the spectrum, where 
the continuum is dominated by the light from old stars. The 
Ha continuum flux is therefore roughly proportional to stel- 
lar mass, and hence EW(Ha) is approximately proportional 
to the SFR per unit stellar mass. 

The overall distribution of galaxy Ha luminosity, equiv- 
alent width, and hence absolute and normalised SFR, are 
found to move to lower levels with increasing environmental 
density (Lewis et al. 2002; Gomez et al. 2003). This gener- 
ally agrees with the colour and morphology trends described 
above, and the variation of Ha emission with morphological 
type (Nakamura et al. 2004). However, if the galaxy popu- 
lation is separated into galaxies which are star-forming and 
those which are not, the distribution of EW(Ha) for each 
component does not depend significantly on environment. 
Only the relative proportion of star-forming galaxies changes 
strongly (Balogh et al. 2004) with environment. This find- 
ing, of distinguishable components in the galaxy popula- 



tion with properties independent of environment but pro- 
portions which vary strongly, mirrors the behaviour found 
in the colour distribution. It also motivates us to perform a 
more rigorous evaluation of the components present in the 
galaxy population in this work. 

As mentioned earlier, an important feature of emission 
lines is that, in addition to star formation, they are also 
produced by AGN. Galaxies whose emission lines are dom- 
inated by star-formation or AGN activity can be separated 
using various diagnostic diagrams. The most common of 
these plots the emission line ratios [OIII] A5007/H/3 versus 
[NII]A6583/Ha, and is known as the BPT diagram (Bald- 
win et al. 1981). The usual approach is to use these dia- 
grams to reject objects inappropriate to the particular study. 
Thus a study of galaxy star formation properties would ex- 
clude all galaxies with signs of AGN contamination. How- 
ever, classifying a galaxy using the BPT diagram requires 
multiple emission lines to be detected, resulting in a fraction 
of objects which cannot be classified. In addition, the sep- 
aration between galaxies dominated by star-formation and 
AGN is not clear, and there appears to be a large popu- 
lation of galaxies which host both star-formation and an 
AGN. Roughly 20% of all galaxies are unambiguously AGN- 
dominated, while it is estimated that a further 20% are star- 
forming galaxies with a significant AGN contribution (Miller 
et al. 2003). This ambiguity means a variety of SFR-AGN 
demarcations exist (Kewley et al. 2001; Kauffmann et al. 
2003; Stasinska et al. 2006). Star formation studies based 
on emission lines have therefore rejected widely varying frac- 
tions of galaxies from their samples. This fraction is usually 
low, so significant numbers of AGN-contaminated galaxies 
remain. More importantly, if our aim is to gain knowledge 
of star-formation properties across the whole galaxy popula- 
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Figure 2. As Fig. 1, but shown here in terms of the untransformed equivalent width, EW(H«). The inset shows the same plot with 
axis-ranges chosen to better show the behaviour at small EW(Ha). 



tion, then we may be rejecting an important fraction of the 
population. If there are any intrinsic correlations between 
AGN and star-formation, as has been suggested by other 
studies (Kauffmann et al. 2003), then information about 
these will be lost. 

A number of classes of AGN have been identified. A 
primary distinction is between Type 1 and Type 2 AGN. In 
Type 1 objects our viewing angle is such that we see the re- 
gion immediately around the central black hole directly, and 
thus the galaxy's light is dominated by the AGN emission. 
In this case the properties of the host galaxy are generally 
very difficult to determine. In Type 2 AGN, the central re- 
gion is obscured by a dusty torus surrounding it. The ob- 
served AGN emission is therefore due to material further re- 
moved from the central ionising source, and mostly confined 
to emission lines. Most photometric and structural galaxy 
properties may therefore be reliably measured, despite the 
presence of a Type 2 AGN. In this work we exclude all Type 
1 AGN, identified by the large widths of their emission lines, 
and consider only the more common Type 2 objects. A fur- 
ther subdivision within Type 2 AGN is between LINER and 
Seyfert 2 objects. These are similar, and may simply be two 
parts of a continuum of objects, with Seyfert 2 AGN be- 
ing more powerful and highly ionised. However, there are 
signs that LINERs and Seyfert 2 AGN are truly physically 
distinct classes (Kewley et al. 2006). 

In this work we examine the components in the distri- 
bution of galaxy EW(Ha), interpretable as a proxy for star 
formation rate and nuclear activity per unit stellar mass. It 
is possible to estimate the true star-formation rate and stel- 
lar mass, for galaxies which do not host an AGN, using a 
combination of several spectral features. However, such es- 
timates are sensitive to the details of the assumed model. 
There is therefore a concern that any finding concerning the 
components of the resulting distribution may be attributable 



to the model. The EW(Ha), on the other hand, is a single, 
robust, model-independent measurement. 

The data we use in our study is from Data Release 4 
of the SDSS (Adelman-McCarthy et al. 2006). The emission 
line fluxes, continua and resulting EW used in this study 
are those provided for DR4 by the MPA-Garching group 
(Tremonti et al. 2004) . All quantities used in this paper 
were obtained from the CMU-PITT SDSS DR4 Value Added 
Catalog 2 (VAC). The SQL code for the selection of each of 
our samples is given in Table 1. We construct a volume- 
limited sample by selecting galaxies with 0.05 < z < 0.095 
and M r < —20.4. In this work we thus focus on the be- 
haviour of fairly bright galaxies. The lower redshift limit 
ensures the spectra are based on a reasonable fraction of 
the galaxies' light; at z = 0.05 the 3 arcsec diameter of 
each spectroscopic fibre corresponds to 3 kpc. Throughout 
we convert to physical scales assuming a flat Friedman- 
Robertson- Walker cosmology with Q m = 0.3, fl\ = 0.7 and 
H = 70 km s^ 1 Mpc' 1 . 



4 MEASURING GALAXY ENVIRONMENT 

Galaxy environment can be characterised in many ways, 
but a commonly adopted value is the local number density 
of galaxies brighter than a given luminosity, averaged over 
some volume or kernel. We estimate the local galaxy num- 
ber density, pt, within a fixed-scale, spherical kernel with a 
Gaussian radial profile and bandwidth b. Our local galaxy 
densities are thus simple to interpret physically. 

To select the bandwidth, or scale, b of the kernel, we 
apply leave-one-out cross-validation; that is, we select the 

1 available from http://www.mpa-garching.mpg.de/SDSS/DR4 

2 available from http://nvogre.phyast.pitt.edu/dr4_value_added 
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Table 1. Definitions of the galaxy samples used in this study, 
given as 'where' clauses of the SQL queries of the CMU-PITT 
SDSS DR4 VAC 

sample SQL selection n 

density !z between 0.02 and 0.10 and 117873 
defining absolute_Petro_r <= -20.4 and 
sample Sort = 

pi.3 !z between 0.05 and 0.095 76420 

sample and absolute_Petro_r <= -20.4 

and 2.4 < Dist_right_edge and 

2.4 < Dist_left_edge and 2.4 

< Dist_upper_edge and 2.4 < 
Dist_lower_edge and H_ALPHA_FLUX > 
-99 and H_ALPHA_C0NT > 0.0001 and 
H_ALPHA_FLUX/H_ALPHA_CONT > -0.4 
and absolute_Petro_u > -990 and 
absolute_Petro_r > -990 and Sort = 

p 5 .5 !z between 0.05 and 0.095 46998 

sample and absolute_Petro_r <= -20.4 

and 11 < Dlst_right_edge and 

11 < Dist_left_edge and 11 

< Dist_upper_edge and 11 < 
Dist_lower_edge and H_ALPHA_FLUX > 
-99 and H_ALPHA_C0NT > 0.0001 and 
H_ALPHA_FLUX/H_ALPHA_C0NT > -0.4 
and absolute_Petro_u > -990 and 
absolute_Petro_r > -990 and Sort = 



value of b which minimizes the estimated integrated mean 
squared error, CV(b). This error is obtained by estimat- 
ing the density function n times, each time leaving out one 
galaxy from the estimation: 

/n 
f n 2 b (x)dx - (1) 
i— 1 

where {Xi} is the set of galaxy positions, and /„,(, and 
f(-i),b are the kernel density estimators with bandwidth 6, 
using all n galaxies and after removing the i th galaxy, re- 
spectively. We compute CV(b) for a range of different band- 
width values to find that which minimizes the error. Apply- 
ing this cross-validation method we determine an optimum 
bandwidth value of 1.3 Mpc. A similar optimum bandwidth 
for local galaxy density estimation was found using cross- 
validation by Balogh et al. (2004). 

Interestingly, this scale corresponds to the size of galaxy 
clusters, and is thus highly appropriate for characterising 
density from a physical, as well as a statistical, point of view. 
However, while cross-validation provides the statistically op- 
timum bandwidth for the whole sample, any choice of band- 
width has its limitations. This density estimator loses resolu- 
tion at low densities, where there are no neighbouring galax- 
ies within the kernel bandwidth, and is thus unable to dis- 
criminate between densities lower than pi. 3 ~ 0.03 Mpc -3 , 
comprising 17% of the sample. In order to probe environ- 
ments less dense than this, but necessarily on larger phys- 
ical scales, we additionally perform the analysis with local 
densities measured using a larger bandwidth of 5.5 Mpc. 
Almost all galaxies have a neighbour within this radius. 
One could also consider estimating densities with a ker- 
nel bandwidth significantly smaller than 1.3 Mpc. However, 
such an estimator would lose resolution below even mod- 
erate densities, where galaxies are typically separated by 



more than the bandwidth. It would also be less able to dis- 
criminate between high density environments, because the 
densities are estimated using galaxy positions uncorrected 
for redshift-space distortions, and hence an increase in true- 
space density no longer results in a higher redshift-space 
density within the kernel. We mostly show results based on 
the statistically-motivated 1.3 Mpc bandwidth in the main 
body of this article, but provide figures using the 5.5 Mpc 
bandwidth in Appendix B, to demonstrate that we find sim- 
ilar results on larger scales and to lower densities. 

We avoid biased density estimates for galaxies at the 
edges of our sample volume by determining the densities us- 
ing a larger volume sample of galaxies with 0.02 < z < 0.10 
and M r < —20.4. We then limit the analysis sample to galax- 
ies with 0.05 < 2 < 0.095 and further than approximately 
twice the bandwidth from a survey boundary. We reject a 
further 3% of galaxies with unreliable EW(Hq) or (u — r) 
rest-frame colour measurements. The exact selections, and 
corresponding sample sizes, are given in Table 1. 



5 APPLYING THE NMR TECHNIQUE 

A brief inspection of the sample EW(Ha) distribution re- 
veals a peak around zero EW, with a long, asymmetric tail to 
high EW. The NMR technique is more computationally effi- 
cient when using symmetrical, Gaussian functions to model 
the distribution. Gaussians are also an obvious choice due to 
their exceptional richness and flexibility. For convenience we 
therefore wish to transform the equivalent width quantity to 
a space where its natural components appear to take a more 
symmetrical, Gaussian, form. Better matching the shape of 
the true distribution components to that assumed in the 
NMR technique will also naturally result in fewer NMR com- 
ponents being required to model the distribution (but see 
Appendix A). The EW extend slightly to negative values, 
proscribing a simple logarithmic transformation. We there- 
fore choose the transformation Wn a = log 10 (EW(Ha) + A). 
The zero offset parameter, A, must be large enough to make 
the logarithm argument positive for the most negative EW 
value in our sample. In constructing our sample we remove 
outliers by requiring EW(Ha) > —0.4, thereby clipping the 
lowest 0.1% of the sample. Therefore, we must have A > 0.4. 
We have examined the behaviour of our NMR fits and their 
likelihood with variations in A. The chosen value has only 
a relatively small effect, slightly altering the shape of the 
Gaussian basis functions once they are transformed back 
into EW space, but not changing our results significantly. 
Here we adopt A = 1.4 as a compromise between maximis- 
ing the fit likelihood and ensuring stable behaviour. We must 
also choose a reasonable bandwidth for the regression kernel 
in p. Following extensive tests we adopt an adaptive band- 
width enclosing the nearest 5000 points (also see discussion 
in Appendix A). 

We apply the NMR technique to the distribution of 
Wna, and determine the optimum number of components 
using the Bayesian Information Criterion (BIC; Schwarz 
1978). Four components are strongly preferred by the data, 
by ABIC > 7 (see Appendix A for more details). In Fig. 1 
(Fig. Bl) we show the NMR components we obtain for the 
pi. 3 (ps.s) sample, at two values of local galaxy density. The 
components are plotted in WHa-space, in which the tech- 
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Figure 3. The behaviour of the NMR components versus environment. The left panel plots the data as dots, along with the location of 
each component, indicated by thick, solid lines, and additionally their widths via the coloured shading and dashed lines. These widths 
are shown explicitly in the middle panel. The right panel displays the variation in the proportion of each component. While the location 
and width of the components do not change significantly with environment, the proportions vary strongly. 




Figure 4. A three-dimensional view of the NMR estimate of the Wjja - P5.5 distribution, shown by the grey, transparent surface, and its 
constituent components, colour-coded is in the previous figures. It can be clearly seen that the positions and widths of the components 
do not change significantly, while their relative proportions vary substantially. 
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nique is applied. We also show the components and data 
transformed back into EW(Ha)-space in Fig. 2 (Fig. B2). 
The properties of these components as a function of envi- 
ronmental density are shown in Fig. 3 (Fig. B3) . In Fig. 4 we 
show a three-dimensional view of the components and their 
sum for the P5.5 sample, which includes all the relevant in- 
formation (location, width and relative proportion of each 
component) in a single plot. We show the results for the P5.5 
here simply because they are smoother than those for pi. 3, 
and the individual components are more clearly visible in 
this three-dimensional view. It is critical to note that the 
only data which has been used to determine these compo- 
nents is the EW(Ha) distribution. 

At this stage we make no attempt at interpreting the 
components as physically distinct populations. Nevertheless, 
Figs. 1, Bl, 2, B2 indicate that the EW(Hq) distribution can 
be well described by multiple components. The hypothesis 
that the galaxy population comprises distinct components, 
or types, is strongly supported by the various property bi- 
modalities described earlier. We find that the locations and 
widths of the components of the EW(Ha) distribution are 
independent of environment. Only the relative proportions 
of the components are found to vary strongly. This implies 
that the variations with environment are primarily the re- 
sult of differences in the relative frequency of each galaxy 
type, rather than changes in the intrinsic properties of each 
type. 

Galaxies move to regions of higher density over time, 
under the influence of gravity. The variation of galaxy prop- 
erties with environment is therefore at least partly due 
to environmentally-dependent changes in individual galaxy 
properties over time. If all galaxies in a given environ- 
ment were affected similarly, we would expect to see smooth 
changes in the property distributions of each individual com- 
ponent. However, we find that the individual components 
remain mostly unchanged with environment. This implies 
that some galaxies are transformed directly from one type to 
another, in an apparently stochastic manner. If this trans- 
formation is sufficiently slow, we would expect to see the 
transitioning galaxies appearing as a separate component in 
the relevant range of local density. If it is rapid, then the 
fraction of transitioning galaxies at any time would be too 
low to separate from the main distribution. 



6 IDENTIFYING THE COMPONENTS 

It is easy to identify the component at zero EW(Hq) with 
passive galaxies, containing no star-formation or AGN activ- 
ity. The dominant component at high EW(Ha) must be as- 
sociated with star-forming galaxies (with the above caveats 
concerning potential AGN contamination). We also find two 
intermediate EW(Hq) components. The principle change 
with environment appears to be the movement of galaxies 
from the star-forming component to the others, but primar- 
ily to the passive component. However, interpreting either 
of these intermediate EW components as a population tran- 
sitioning between star-forming and passive is inconsistent 
with their existence as a significant fraction of the galaxy 
population even at low environmental densities. 

To explore the physical interpretation of the compo- 
nents we have found, we now turn to more traditional di- 
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Figure 5. The BPT diagram for our pi. 3 sample, traditionally 
used to identify star-forming galaxies and AGN hosts. For clar- 
ity, only one-fifth of our sample galaxies are plotted. The LINER, 
Seyfert 2 and SF dominated regions are colour-coded to match 
our interpretation of their correspondence to the NMR compo- 
nents shown in the other figures (purple, blue and green, respec- 
tively). Note that many galaxies cannot be placed on this di- 
agram. These are passive galaxies, with no emission lines, and 
uncertain galaxies, with some detected emission lines, but not all 
four of those required for inclusion in this diagram. 

agnostics to separate the contributions from star formation 
(SF) and AGN to the emission lines. The BPT diagram for 
our P1.3 sample is shown in Fig. 5. In order to appear on 
this plot, all four required emission lines must be detected 
at > 2 sigma significance. The classifications we define are as 
follows; passive: no emission lines detected, SF dominated: 
all four lines detected and below the curve of Stasinska et al. 
(2006), AGN dominated: above the line of Kewley et al. 
(2001) with either all four lines detected or with both lines 
for just one of the ratios detected and [OIII]/H/3 > 0.6 or 
[Nil] /Ha > 0.05, AGN+SF: all four lines detected and be- 
tween the curves of Kewley et al. (2001) and Kauffmann 
et al. (2003), SF+AGN: all four lines detected and between 
the curves of Stasinska et al. (2006) and Kauffmann et al. 
(2003), uncertain: at least one of the four emission lines 
detected, but none of the other classification criteria met. 
Note that the majority of AGN-dominated galaxies can be 
robustly identified simply from their [Nil] /Ha ratio (Miller 
et al. 2003; Stasinska et al. 2006). 

Our classification method is such that galaxies classi- 
fied as AGN dominated must contain a significant AGN 
component, and will have low contribution to their emission 
lines from star formation. On the other hand SF dominated 
galaxies may well also contain up to ~ 20-40% AGN con- 
tamination in their emission lines (Kauffmann et al. 2003; 
Stasinska et al. 2006). The AGN dominated galaxies can be 
further subdivided into LINER and Seyfert 2 sources using 
the BPT diagram (Kauffmann et al. 2003). 

Figure 6 shows the Wna-pi.3 distributions of galaxies 
classified using the BPT diagram. Comparing with Fig. 3, 
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Figure 6. The Whc* — P1.3 distribution for objects in our sample colour-coded by their location in the BPT diagram shown in Fig. 4. The 
lines indicate the median Wn a in bins of pi.3 for each subsample. The left panel shows passive, LINER, Seyfert 2 and SF dominated 
galaxies (in order of increasing Who), while the right panel shows uncertain, AGN+SF and SF+AGN galaxies (brown, orange and cyan, 
respectively, and again in order of increasing Who)- A comparison with Fig. 2 reveals a correspondence between the NMR components 
and, in order of increasing Whc«> (1) passive galaxies, (2) LINER and uncertain galaxies, (3) Seyfert 2 and AGN+SF galaxies, and (4) 
SF dominated and SF+AGN galaxies. 



one can clearly identify the NMR components with the pas- 
sive, LINER, Seyfert 2 and SF dominated BPT-classified 
galaxies. The large fraction of galaxies for which the BPT 
diagram gives an uncertain result may also be identified with 
the components. The galaxies with apparently mixed star 
formation and AGN emission are found at similar Wna to 
the Seyfert 2 objects, and the higher intermediate NMR 
component. Galaxies with at least one emission line, but 
which cannot be identified via the BPT diagram have simi- 
lar Wna to LINER objects and the lower NMR component. 
While not conclusive, this strongly suggests that the compo- 
nents derived from the NMR technique do represent phys- 
ically distinct populations. This is remarkable given that 
the NMR components have been inferred from just a single 
emission line. 



7 A NEW INSIGHT INTO THE GALAXY 
POPULATION 

By applying the newly developed NMR method to the Ha 
equivalent width distribution, a single astrophysical quan- 
tity that contains information on both star formation and 
nuclear activity, we have identified four distinct components 
in the galaxy population. None of these components vary 
significantly with environment, in terms of the distribution 
of their Ha equivalent widths. However, the relative propor- 
tions of galaxies in each component vary substantially with 
environment. This implies that any environmental processes 
at work do not affect all galaxies in a gradual way, which 



would result in changes in the component Ha equivalent 
width distributions. Rather, they must rapidly transform 
a fraction of galaxies from one component to another, in a 
stochastic manner, in order to avoid changing the properties 
of the individual components. 

The above conclusions stand without requiring us to 
identify the components with more traditional galaxy sub- 
populations. However, when we attempt such an identifi- 
cation, we find that the extreme components may be as- 
sociated with passive and star-forming galaxies, while the 
two intermediate components display similarities to galax- 
ies hosting LINERs and Seyfert 2 AGN. Galaxies with an 
apparent mix of star-formation and AGN may also be iden- 
tified with these components. However, in contrast to the 
usual methods of classifying the star-formation and AGN 
properties of galaxies, which require multiple emission lines 
to be significantly detected, the technique we describe in 
this paper is applicable to all galaxies. We thereby avoid the 
issue of excluding objects for which traditional methods are 
uncertain, and the biases which this may introduce. 
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APPENDIX A: NONPARAMETRIC MIXTURE 
REGRESSION 

This is a newly developed statistical method for determin- 
ing the dependences of one variable, y, on another, x, where 
there may be multiple components present in the data, each 
with a different y on x dependence. For the analysis pre- 
sented in the main body of this article we use this technique, 
putting x = pi. 3 or ps.5, estimates of the local environmen- 
tal density, and y = Whc, a transformed version of the Ha 
equivalent width (see Section 3). Here we give a technical 
description of the method. 

We model the probability, f(y\x), of y given a; as a sum 
of components, thus 

c(x) 

f(y\x; &(x)) = Ti(aO*(vl»li(aO) (Al) 
i=i 

where the Sj{y\r)j(x)), are density functions with a vector of 
parameters r)j(x) that depends on x, and the iVj(x)'s are a 
set of mixing proportions that sums to one for each x. In this 
paper we use Gaussian functions to model the components, 
each with parameters r] i = (pii,cr;), mean and standard de- 
viation respectively. The number of components is c(x), and 
may vary as a function of x. Gaussians are rich and flexible 
functions which are highly suited to this task, particularly 
if one wishes to avoid the danger of overly designing the 
method to fit one's expectations of the results. 

The parameter set, &{x) (0i(x), . . . , c (x)(x)) = 
(n 1 (x),ri 1 (x), . . . ,ir c (x)(%),T) c (x)( x ))' is determined using lo- 
cal likelihood estimation. The parameters are approximated 
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nent i. The contribution to C p of data at distance S m from 



Pl.3 [ M P C 3 ] 



Figure Al. Offsets in the Bayesian Information Criterion (BIC) 
score versus local galaxy density, pi. 3, for NMR fits utilising 2, 
3, and 5 components, relative to the favoured 4 component fit. 
Where the 5 component fit BIC offset is zero at low pi. 3, the NMR 
method only uses 4 of the 5 available components as two of the 
components are degenerate. Four components are thus preferred, 
by significantly higher BIC values, at all local densities. 



locally by a polynomial of degree p, and hence vary smoothly 
with x. The variation of the parameters can thus be de- 
scribed by a set of polynomial coefficients, B. These coef- 
ficients may then be constrained by data, weighted using a 
kernel of bandwidth b(x) about x. 

The log-likelihood function of the set of polynomial co- 
efficients B given the data is therefore 



C p (B;x, b, c(x)) = w m (x; b) x 



(A2) 



\og B f(Y m ,x;T(X m -x,B)), 



for n measurements labelled by m, with locations (x,y) = 
(X m ,Y m ). The set of polynomial functions approximating 
the parameters @ at 1 are 

T(5 m ,B) = (*i,i(5m,/3i,i),- ■ ■ ,ti,i{S m ,0 l qi ),. . . , 

*c(*),l(*™)^c(a),l)>- ■ •!*c(*),l(*m,/3 c ( x ),< Ie(oj) )), (A3) 

defining <5 m = X m — x, with 



(A4) 



where i = 1, ...,c(x) counts over the components, j = 
1, . . . , q% counts the parameters of component i (in our case 
each density function is a Gaussian with parameters p and 
a, and with mixing weight n, thus =3), and fc = 0, . . . ,p 
counts the degrees of the polynomials used in T to approxi- 
mate the parameters 0. The Pi.j.k, and hence their contain- 
ing sets, (3 i j and B, correspond to a particular value of x. 
Note that the Pi,j,k give approximations around 8 m = for 
the value and fc-th derivative of the parameter j of compo- 



x is specified by 
w m (x; b(x)) = W 



X r , 



b(x) 



(A5) 



where W(z) is a weighting function. 

One can then attempt to determine the B which 
maximises the local log-likelihood, C p , which we de- 
note B(x;b(x),c(x)), explicitly indicating its dependencies. 
Therefore, 



B(x;b(x),c(x)) = argmax \ Wj(x;b(x)) 



(A6) 



3=1 



c(x) 



log e ^SifcMXj - x,P itl ), ■ ■ .lU^iXi - x,f3 im )). 
i=i 

The local likelihood estimate for the set of parameters is 
then defined by ®(x;b(x),c(x)) = T(0,B(x;b(x),c(x))), 
that is Qi,j{x; b(x), c(x)) = /3ij t o(x;b(x),c(x)). Our condi- 
tional density estimate given b(x) and k(x) is therefore 



f(y\x; b{x),c{x)) = f(y\x; &(x; b(x),c(x))). 



(A7) 



In general, given b(x) and c(x), the standard method of solv- 
ing Eqn. A6 is to use the Expectation-Maximisation (EM) 
method (McLachlan & Krishnan 1997). 

The estimator Eqn. A7 is dependent upon the chosen 
bandwidth b(x) and number of components c(x). If they are 
a priori unknown, we must therefore select them in some 
reliable way. 

In this work we have chosen the bandwidth for x = pi. 3 
or P5.5 to be a function of the Kth nearest neighbour. We use 
K — 5000, selected as a compromise between the smooth- 
ness of the resulting component regression lines and their 
ability to trace any variation in Wna versus environment. 
We have checked that the exact choice of K (within the 
range 1000-7500) does not affect our results. The optimum 
number of components was determined using the Bayesian 
Information Criterion(BIC; Schwarz 1978): 

1 



BIC = C v 



-(3c-l)log e (Jf) 



(A8) 



where C v is the maximised log- likelihood, c is the number 
of components, and K is the sample size. With this defi- 
nition, otherwise known as the Schwarz Criterion, the pre- 
ferred model is that which maximises the value of BIC. Note 
that other definitions sometimes multiply the right hand 
side of Eqn. A8 by —2. The difference between the BIC 
values of two models, ABIC, approximates the natural log- 
arithm of the Bayes factor, a summary of the evidence for 
one model over another. A ABIC of 7 indicates that the pre- 
ferred model is truly better than the alternative model with 
odds better than a thousand to one. A Bayes factor of > 150, 
i.e. ABIC > 5, is generally taken to be very strong evidence 
for the preferred model (Kass & Raftery 1995). Four com- 
ponents are thus very strongly favoured, by ABIC = 147.1, 
11.9 and 7.7 versus 2, 3 and 5 components, respectively, av- 
eraged over log 10 pi.3. The ABIC are shown versus pi.3 in 
Fig. Al. 

One might argue that choosing different density func- 
tions, other than Gaussians, or applying a different trans- 
formation, would result in our finding a different optimal 
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number of components. However, when varying the Whc = 
log 10 (EW(Ha) + A) transformation by changing A, and try- 
ing various combinations of Gaussians and lognormal func- 
tions in EW(Ha)-space, the optimum number of components 
has consistently turned out to be four. A careful visual in- 
spection of the EW(Ha) and Wn a distributions also sup- 
ports this conclusion. 

Obviously one could examine the data and devise com- 
ponent density functions that would result in the NMR 
method finding any desired number of components. How- 
ever, this defeats the object of employing the NMR tech- 
nique. By 'components' we mean simple, distinct elements of 
the overall population. We must therefore make only simple 
assumptions and transformations in order to identify them, 
with minimal prior reference to the data. 

If two or more NMR components together represent 
only a single true component of the galaxy populations, then 
we would expect them to behave identically. Otherwise, they 
could not represent a single component, by definition. How- 
ever, our four NMR components each demonstrate different 
behaviour with respect to local density, indicating they are 
truly distinct (see Figs. 1-3, B1-B3). 

Finally, the components we find using the NMR tech- 
nique correspond remarkably well to traditional galaxy clas- 
sifications (compare Figs. 3 & 6). This strongly supports our 
interpretation of the NMR components as physically distinct 
elements of the galaxy population. However, the NMR com- 
ponents have the advantage of being based on all galaxies 
in our sample. Traditional diagnostic diagrams can only be 
used for objects with multiple, significantly-detected, emis- 
sion lines, and in many cases give ambiguous classifications 
(e.g. SF+AGN). 



APPENDIX B: LARGER SCALE 
ENVIRONMENT 

The main text of the paper focuses on a measure of envi- 
ronment using a kernel of bandwidth 1.3 Mpc, chosen by 
cross-validation. This bandwidth performs well at the scales 
of galaxy clusters. However, at low densities there is fre- 
quently only one galaxy within the kernel, and the estimator 
is unable to differentiate between different low-density envi- 
ronments. We thus additionally perform our analysis using 
local densities estimated using a kernel with 5.5 Mpc band- 
width. The results are very similar to those from the 1.3 Mpc 
densities, and thus our conclusions are robust to the precise 
definition of local density. The figures corresponding to the 
5.5 Mpc kernel are given in this appendix. 
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Figure Bl. As Fig. 1, but for local galaxy densities estimated using a 5.5 Mpc bandwidth kernel, P5.5. The results are very similar to 
those found using pi. 3. 




Figure B2. As Fig. Bl, but shown in terms of the untransformed equivalent width, EW(Ha). The inset shows the same plot with 
axis-ranges chosen to better show the behaviour at small EW(Ha). 




Figure B3. As Fig. 3, but for local galaxy densities estimated using a 5.5 Mpc bandwidth kernel, P5.5. The results are very similar to 
those found using pi.3. 
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